Bioinformatics A Practical Guide to Next Generation Sequencing Data Analysis (Hamid D. Ismail)

Mapping of Sequence Reads to the Reference Genomes ◾ 85

The VCF can be compressed using the Linux compression program “bgzip” and then the

compressed file is indexed using tabix program, which is a tool for indexing large bioinfor-

matics text files. The tabix program can be installed on Linux using:

sudo apt install tabix

The following commands compress the VCF file and index it:

bgzip -c SRR769545.vcf > SRR769545.vcf.gz

tabix -p vcf SRR769545.vcf.gz

The final step in the reference-guided genome assembly is to create a consensus sequence

by transferring the sequence reference genome, which was used to create the original BAM

file, to the “bcftools consensus” command that utilizes the indexed variant call file to cre-

ate a new genome sequence for the individual studied.

cat ../ref/hg38.fa \

| bcftools consensus SRR769545.vcf.gz \

> SRR769545_genome.fasta

The “SRR769545_genome.fasta” is the FASTA genome sequence that incorporates variants

genotyped for the individual from whom the whole genome was sequenced. If a region of

the reference genome is uncovered by the reads, this assembly method may skip some vari-

ants; therefore, high sequence coverage is required. Instead of incorporating the uncovered

regions of the reference genome in the new sequence, the bases of those regions can be

masked by Ns. There are several more advanced programs for reference-guided genome

such as RATT [19].

There is a different approach to sequence a genome of an organism from scratch without

using a reference genome. This approach is called de novo assembly which is discussed in

Chapter 3.

2.6 SUMMARY

Except for the sequencing applications that use de novo genome assembly, read mapping

to a reference genome is the most fundamental step in the workflow of the sequencing

data analysis. In the NGS or TGS, the DNA molecules are fragmented into pieces in the

library preparation step and then the DNA libraries are sequenced to produce the raw data

in the form of millions of reads of specific lengths. The lengths of the reads produced by

a high-throughput sequencing instrument vary based on the technology used into short

reads (50–400 bp) or long reads (>400 bp). The quality control step to assess and prepro-

cess the raw reads is essential to reduce the errors in the base calling and the biases that

may arise due to the presence of technical reads. Indeed, read mapping is the most com-

putationally expensive step in the sequencing data analysis workflow. That is because the

alignment program attempts to determine the points of origin for millions or billions of

reads in a reference genome. The alignment requires even more efforts for RNA-Seq read